{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XOM4j723Kxc2"
   },
   "source": [
    "# Applying Contextual Bandits for Recommendation systems using Tensorflow and Cloud Storage\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sLrwp8njKxc4"
   },
   "source": [
     "## Learning objectives\n",
     "\n",
     "1. Install and import required libraries.\n",
     "2. Initialize and configure the MovieLens Environment.\n",
     "3. Initialize the Agent.\n",
     "4. Define and link the evaluation metrics.\n",
     "5. Initialize & configure the Replay Buffer.\n",
     "6. Setup and Train the model.\n",
     "7. Inference with trained model & Tensorboard Evaluation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ifdzP2rRKxc5"
   },
   "source": [
    "## Introduction\n",
    "\n",
    "\n",
    "Multi-Armed Bandit (MAB) is a Machine Learning framework in which an agent has to select actions (arms) in order to maximize its cumulative reward in the long term. In each round, the agent receives some information about the current state (context), then it chooses an action based on this information and the experience gathered in previous rounds. At the end of each round, the agent receives the reward assiociated with the chosen action.\n",
    "\n",
    "\n",
    "https://www.tensorflow.org/agents/tutorials/intro_bandit#multi-armed_bandits_and_reinforcement_learning\n",
    "\n",
    "Each learning objective will correspond to a _#TODO_ in this student lab notebook -- try to complete this notebook first and then review the [solution notebook](../solutions/exercise_movielens_notebook.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vpEsA3xmKxc6"
   },
   "source": [
    "## 1. Initial Setup: installing and importing required Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -rf /opt/conda/lib/python3.9/site-packages/typing_extensions-4.4.0.dist-info\n",
    "!rm -rf /opt/conda/lib/python3.9/site-packages/six-1.16.0.dist-info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 138735,
     "status": "ok",
     "timestamp": 1626963316053,
     "user": {
      "displayName": "Akshyakumar Patil",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GhcXLXyc82HXSSTHGGpIjdoq3foK_XmRW6iM-UUjg=s64",
      "userId": "15844011989315309512"
     },
     "user_tz": -330
    },
    "id": "fFMdfJoEKxc6",
    "outputId": "908f0503-10a1-4961-ecbe-1b8f852f322d"
   },
   "outputs": [],
   "source": [
    "!pip install --user --quiet --upgrade --force-reinstall tensorflow tensorflow_probability tensorflow-io\n",
    "!pip install tf_agents --quiet gast --upgrade"
   ]
  },
 {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --user --quiet --upgrade  gast"
   ]
  },
 {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install tensorflow==2.15.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "WttMqkUYKxc9"
   },
   "outputs": [],
   "source": [
    "import functools\n",
    "import os\n",
    "from absl import app\n",
    "from absl import flags\n",
    "\n",
    "import tensorflow as tf  # pylint: disable=g-explicit-tensorflow-version-import\n",
    "from tf_agents.bandits.agents import dropout_thompson_sampling_agent as dropout_ts_agent\n",
    "from tf_agents.bandits.agents import lin_ucb_agent\n",
    "from tf_agents.bandits.agents import linear_thompson_sampling_agent as lin_ts_agent\n",
    "from tf_agents.bandits.agents import neural_epsilon_greedy_agent as eps_greedy_agent\n",
    "from tf_agents.bandits.agents.examples.v2 import trainer\n",
    "from tf_agents.bandits.environments import environment_utilities\n",
    "#from tf_agents.bandits.environments import movielens_per_arm_py_environment\n",
    "from tf_agents.bandits.environments import movielens_py_environment\n",
    "from tf_agents.metrics import tf_metrics\n",
    "from tf_agents.bandits.metrics import tf_metrics as tf_bandit_metrics\n",
    "from tf_agents.bandits.networks import global_and_arm_feature_network\n",
    "from tf_agents.environments import tf_py_environment\n",
    "from tf_agents.networks import q_network\n",
    "from tf_agents.drivers import dynamic_step_driver\n",
    "from tf_agents.eval import metric_utils\n",
    "from tf_agents.policies import policy_saver\n",
    "from tf_agents.replay_buffers import tf_uniform_replay_buffer\n",
    "from tf_agents.trajectories import time_step as ts\n",
    "\n",
    "# If there are version / incompatibility errors, make sure you restarted the kernel and use !pip freeze in a new cell to check whether the correct TF and tf_agents version had been installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create target Directory if don't exist\n",
    "from datetime import date\n",
    "today = date.today()\n",
    "fdate = date.today().strftime('%d_%m_%Y')\n",
    "\n",
    "root_path = os.getcwd()\n",
    "log_path = \"{}/{}\".format(root_path, fdate)\n",
    "if not os.path.exists(log_path):\n",
    "    os.mkdir(log_path)\n",
    "    print(\"Directory {} Created\".format(fdate))\n",
    "else:    \n",
    "    print(\"Directory {} already exists\".format(fdate))\n",
    "\n",
    "print(\"Full path is {}\".format(log_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "do58umOOKxdL"
   },
   "source": [
    "## 2. Initializing and configuring the MovieLens Environment\n",
    "\n",
    "Firstly we need to load the movielens.data csv file stored in cloud storage, load it locally and initilialze the MovielensPyenvironment with it. Refer [here](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/environments/movielens_py_environment/MovieLensPyEnvironment) for guidance on it.\n",
    "\n",
    "An environment in the TF-Agents Bandits library is a class that provides observations and reports rewards based on observations and actions.\n",
    "\n",
    "We will be using the MovieLens environment. This environment implements the MovieLens 100K dataset, available at:\n",
    "  https://www.kaggle.com/prajitdatta/movielens-100k-dataset\n",
    "\n",
    "This dataset contains 100K ratings from `m=943` users on `n=1682` items. The ratings can be organized as a matrix `A` of size `m`-by-`n`.\n",
    "\n",
    "Note that the ratings matrix is a <b>sparse matrix</b> i.e., only a subset of certain (user, movie) pairs is provided, since not all users have seen all movies.\n",
    "In order for the environment to be able to compute a reasonable estimate of the reward, which represents how much a user `i` would enjoy a movie `j`,\n",
    "the environment computes a dense approximation to this sparse matrix `A`.\n",
    "In collaborative filtering, it is common practice to obtain this dense approximation by means of a low-rank matrix factorization of the matrix A.\n",
    "\n",
    "The MovieLens environment uses truncated Singular Value Decomposition (SVD) (but other matrix factorization techniques could be potentially also used).\n",
    "With truncated SVD of rank `k`, the matrix `A` is factorized as follows:\n",
    "$A_k = U_k \\Sigma_k V_k^T$,\n",
    "where:\n",
    "<li>$U_k$ is a matrix of orthogonal columns of size $m$-by-$k$,<\\li>\n",
    "<li>$V_k$ is a matrix of orthogonal columns of size $n$-by-$k$</li>\n",
    "<li> $\\Sigma_k$ is a diagonal matrix of size $k$-by-$k$ that holds the $k$ largest singular values of A.</li>\n",
    "\n",
    "\n",
    "By splitting $\\Sigma$ into $\\sqrt{\\Sigma_k} \\sqrt{\\Sigma_k}$, we can finally approximate the matrix A as a \n",
    "product of two factors $\\tilde{U}$ and $\\tilde{V}$ i.e.,\n",
    "\n",
    "$A ~= \\tilde{U} \\tilde{V}^T$,\n",
    "where $\\tilde{U} = U_k \\sqrt{\\Sigma_k}$ and $\\tilde{V} = V_k \\sqrt{\\Sigma_k}$\n",
    "\n",
    "Once the matrix factorization has been computed, the environment caches it and uses it to compute the reward for recommending an movie `j` to a user `i` \n",
    "by retrieving the (`i`, `j`)-entry of matrix $A$.\n",
    "    \n",
    "\n",
    "Apart from computing the reward when the agent recommends a certain movie to a user, the environment is also responsible for generating observations that are given as input to the agent in order to make an informed decision. In order to generate a random observation, the environment samples a random row `i` from the matrix $\\tilde{U}$. Once the agent selects movie `j` then the environment responds with the (`i`, `j`)-entry of matrix $A$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "F-bx9j4XZj1M"
   },
   "outputs": [],
   "source": [
    "# initialize the movielens pyenvironment with default parameters\n",
    "NUM_ACTIONS = None # take this as 20\n",
    "RANK_K = None # take rank as 20\n",
    "BATCH_SIZE = None # take batch size as 8\n",
    "data_path = \"gs://ta-reinforecement-learning/dataset/movielens.data\" # specify the path to the movielens.data OR get it from the GCS bucket\n",
    "#TODO: replace the data path if needed\n",
    "env = movielens_py_environment.MovieLensPyEnvironment(\n",
    "        data_path, RANK_K, BATCH_SIZE, num_movies=NUM_ACTIONS)\n",
    "environment = tf_py_environment.TFPyEnvironment(env)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Cv5svfBfKxdU"
   },
   "source": [
    "## 3. Initializing the Agent\n",
    "Now that we have the environment query we reach the part where we define and initialize our policy and the Agent which will be our utilize that policy to make decisions given an observation. We have several policies: as shown here:\n",
    "\n",
    "   1. [NeuralEpsilonGreedyAgent](https://medium.com/analytics-vidhya/the-epsilon-greedy-algorithm-for-reinforcement-learning-5fe6f96dc870): The neural episilon greed algorithm makes a value estimate for all the arms, and then chooses the best arm with the probaility (1-episilon) and any of the random arms with a probability of epsilon. this balances the exploration-exploitation tradeoff and epsilon is set to a small value like 10%. Example: In this example we have seven arms: one of each of the classes, and if we set episilon to say 10%, then 90% of the times the agent will choose the arm with the highest value estimate ( expplotiing the one most likely to be the predicted class) and 10% of the time it will choose a random arm from all of the 7 arms( thus exploring the other possibilities). Refer [here](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/agents/neural_epsilon_greedy_agent/NeuralEpsilonGreedyAgent) for more information of the tensorflow agents version of the same.\n",
    "\n",
    "    \n",
    "   Each Agent is initializied with a policy: which is essentially the function approximator ( be it linear or non linear) for estimating the Q values. Ther agen trains this policy, and the policy adds the exploration-exploitation component on top of this, and also chooses the action. In this example we will use a Deep Q Network as our value function, and we use the epsilon greedy on topof this to select the actions. In this case the action space would be 20 for 20 movies, the contextual state vector would be the dense user vector from the matrix decomposition. In applied situations, a dictionary mapping could be made from a user id to its dense representation to make it more convinient for the end user.\n",
    "\n",
    "   - Step 1. Initialize the  Qnetwork, which takes in the state and returns the value function for each action. Define the Fully connected layer parameters to be `(50, 50, 50)` from the left to the right respectively.\n",
    "   - Step 2. Creating a neuron Epsilon greedy agent  with an Adam Optimizer with **Epsilon exploration value** of `0.05`, **learning rate** = `0.005`, **Dropout rate** = `0.2`. Feel free to experiment with these later to gauge their impact on the training later\n",
    "   \n",
    "Click [here](https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial#agent) for reference on code example of how to create a  Q network and DQN Agents\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "szHbMXyhZj1N"
   },
   "outputs": [],
   "source": [
    "# Replace these values by reading the above instructions carefully\n",
    "EPSILON = 0\n",
    "LAYERS = None\n",
    "LR = 0\n",
    "DROPOUT_RATE = 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_fGaJF6mKxdV"
   },
   "outputs": [],
   "source": [
    "# Initialize the Qnetwork\n",
    "network = q_network.QNetwork(\n",
    "          input_tensor_spec=environment.time_step_spec().observation,\n",
    "          action_spec=environment.action_spec(),\n",
    "          fc_layer_params=LAYERS)\n",
    "\n",
    "# Creating a neuron Epsilon greedy agent with an optimizer, \n",
    "# Epsilon exploration value, learning & dropout rate\n",
    "# Replace all the `None` values with the required values\n",
    "agent = eps_greedy_agent.NeuralEpsilonGreedyAgent(\n",
    "  time_step_spec=None,# get the spec/format of the environment\n",
    "  action_spec=None, # get the spec/format of the environment\n",
    "  reward_network=None, #q network goes here\n",
    "  optimizer=None #start w/ adam optimizer with a learning rate of .002\n",
    "  epsilon=EPSILON) # we recommend an exploration of value of 1%)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CYsBS9eRKxdX"
   },
   "source": [
    "## 4. Define and link the evaluation metrics\n",
    "\n",
    "\n",
    "Just like you have metrics like accuracy/recall in supervised learning, in bandits we use the [regret](https://www.tensorflow.org/agents/tutorials/bandits_tutorial#regret_metric) metric per episode. To calculate the regret, we need to know what the highest possible expected reward is in every time step. For that, we define the `optimal_reward_fn`.\n",
    "\n",
    "Another similar metric is the number of times a suboptimal action was chosen. That requires the definition if the `optimal_action_fn`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-l0e78osKxdX"
   },
   "outputs": [],
   "source": [
    "# Making functions for computing optimal reward/action and attaching the env variable to it using partial functions, so it doesnt need to be passed with every invocation\n",
    "optimal_reward_fn = functools.partial(\n",
    "      environment_utilities.compute_optimal_reward_with_movielens_environment,\n",
    "      environment=environment)\n",
    "\n",
    "optimal_action_fn = functools.partial(\n",
    "      environment_utilities.compute_optimal_action_with_movielens_environment,\n",
    "      environment=environment)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "yXSX61X5Kxda"
   },
   "outputs": [],
   "source": [
    "# Initilializing the regret and suboptimal arms metric using the optimal reward and action functions\n",
    "regret_metric = tf_bandit_metrics.RegretMetric(optimal_reward_fn)\n",
    "suboptimal_arms_metric = tf_bandit_metrics.SuboptimalArmsMetric(\n",
    "      optimal_action_fn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "DynX8nIBKxdc"
   },
   "outputs": [],
   "source": [
    "step_metric = tf_metrics.EnvironmentSteps()\n",
    "metrics = [tf_metrics.NumberOfEpisodes(),  #equivalent to number of steps in bandits problem\n",
    "           regret_metric,  # measures regret\n",
    "           suboptimal_arms_metric,  # number of times the suboptimal arms are pulled\n",
    "           tf_metrics.AverageReturnMetric(batch_size=environment.batch_size)  # the average return\n",
    "           ]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Qi5VVgkeKxde"
   },
   "source": [
    "## 5. Initialize & configure the Replay Buffer\n",
    "Reinforcement learning algorithms use replay buffers to store trajectories of experience when executing a policy in an environment. During training, replay buffers are queried for a subset of the trajectories (either a sequential subset or a sample) to \"replay\" the agent's experience. Sampling from the replay buffer facilitate data re-use and breaks harmful co-relation between sequential data in RL, although in contextual bandits this isn't absolutely required but still helpful.\n",
    "\n",
    "The replay buffer exposes several functions which allow you to manipulate the replay buffer in several ways. Read more on them [here] (https://www.tensorflow.org/agents/tutorials/5_replay_buffers_tutorial)\n",
    "\n",
    "In this demo we would be using the TFUniformReplayBuffer for which we need to initialize the buffer spec with the spec of the trajectory of the agent's policy, a chosen batch size( number of trajectories to store), and the maximum length of the trajectory. ( this is the amount of sequential time steps which will be considered as one data point). so a batch of 3 with 2 time steps each would result in a tensor of shape (3,2). Since unlike regular RL problems, Contextual bandits have only one time step we can keep max_length =1, however since this tutorial is to enable you for RL problems as well, let set it to 2. Do not worry, any contextual bandit agent will internally\n",
    "split the time steps inside each data point such that the effective batch size ends up being (6,1). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4P0uSb9eKxdf"
   },
   "source": [
    "Create a Tensorflow based UniformReplayBuffer And initialize it with an appropriate values.\n",
    "Recommended:\n",
    "    **Batch size** = `8`\n",
    "    **Max length** = `2` ( 2 time steps per item)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "WcsLSlfDKxdf"
   },
   "outputs": [],
   "source": [
    "#TODO\n",
    "\n",
    "STEPS_PER_LOOP = None\n",
    "\n",
    "# TFUniformReplayBuffer is the most commonly used replay buffer in TF-Agents. Use 'tf_uniform_replay_buffer.TFUniformReplayBuffer' to create one.\n",
    "buf = None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RYYbzaPuKxdh"
   },
   "source": [
    "Now we have a Replay buffer but we also need something to fill it with. Often a common practice is to have \n",
    "the agent Interact with and collect experience with the environment, without actually learning from it ( i.e. only forward pass). This loop can  be either by you manually as shown [here](https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial#training_the_agent) or you can do it using the DynamicStepDriver.\n",
    "The data encountered by the driver at each step is saved in a named tuple called Trajectory and broadcast to a set of observers such as replay buffers and metrics. \n",
    "This Trajectory includes the observation from the environment, the action recommended by the policy, the reward obtained, the type of the current and the next step, etc. \n",
    "\n",
    "In order for the driver to fill the replay buffer with data, as well as to compute ongoing metrics, it needs acess to the add_batch, functionality of the buffer, and the metrics ( both step and regular). Refer [here](https://www.tensorflow.org/agents/tutorials/5_replay_buffers_tutorial#data_collection) for more information aand example code on how initialize a step driver with observers. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-yUkG51PKxdh"
   },
   "outputs": [],
   "source": [
    "#TODO: setup the replay observer as a list to capture both metrics, step metrics and provide access to the function to load data from the driver into the buffer\n",
    "replay_observer = None\n",
    "\n",
    "driver = dynamic_step_driver.DynamicStepDriver(\n",
    "      env=None,\n",
    "      policy=None,\n",
    "      num_steps=STEPS_PER_LOOP * environment.batch_size,\n",
    "      observers=None)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BUlBfDUiZj1P"
   },
   "source": [
    "## 6. Setup and Train the Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "leUtw6DDKxdk"
   },
   "source": [
    " Here we provide you a helper function in order to save your agent, the metrics and its lighter policy seperately, while training the model. We make all the aspects into trackable objects and then use checkpoint to save as well warm restart a previous training. For more information on checkpoints and policy savers ( which will be used in the training loop below) refer [here](https://www.tensorflow.org/agents/tutorials/10_checkpointer_policysaver_tutorial)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "kqUY-HRKZj1Q"
   },
   "outputs": [],
   "source": [
    "AGENT_CHECKPOINT_NAME = 'agent'\n",
    "STEP_CHECKPOINT_NAME = 'step'\n",
    "CHECKPOINT_FILE_PREFIX = 'ckpt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "mWELf3aPKxdk"
   },
   "outputs": [],
   "source": [
    "def restore_and_get_checkpoint_manager(root_dir, agent, metrics, step_metric):\n",
    "    \"\"\"Restores from `root_dir` and returns a function that writes checkpoints.\"\"\"\n",
    "    trackable_objects = {metric.name: metric for metric in metrics}\n",
    "    trackable_objects[AGENT_CHECKPOINT_NAME] = agent\n",
    "    trackable_objects[STEP_CHECKPOINT_NAME] = step_metric\n",
    "    checkpoint = tf.train.Checkpoint(**trackable_objects)\n",
    "    checkpoint_manager = tf.train.CheckpointManager(checkpoint=checkpoint,\n",
    "                                                  directory=root_dir,\n",
    "                                                  max_to_keep=5)\n",
    "    latest = checkpoint_manager.latest_checkpoint\n",
    "\n",
    "    if latest is not None:\n",
    "        print('Restoring checkpoint from %s.', latest)\n",
    "        checkpoint.restore(latest)\n",
    "        print('Successfully restored to step %s.', step_metric.result())\n",
    "    else:\n",
    "        print('Did not find a pre-existing checkpoint. '\n",
    "                 'Starting from scratch.')\n",
    "    return checkpoint_manager\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "c6ttQ0RjKxdm"
   },
   "outputs": [],
   "source": [
    "checkpoint_manager = restore_and_get_checkpoint_manager(\n",
    "  log_path, agent, metrics, step_metric)\n",
    "saver = policy_saver.PolicySaver(agent.policy)\n",
    "summary_writer = tf.summary.create_file_writer(log_path)\n",
    "summary_writer.set_as_default()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nPJBV8WjKxdo"
   },
   "source": [
    "Now we have all the components ready to start training the model. Here is the process for Training the model\n",
    "1. We first use the DynamicStepdriver instance to collect experience( trajectories) from the environment and fill up the replay buffer.\n",
    "2. We then extract all the stored experience from the replay buffer by specfiying the batch size and num_steps the same as we initialized the driver with. We extract it as tf.dataset instance.\n",
    "3. We then iterate on the tf.dataset and the first sample we draw actually has all the data batch_size*num_time_steps\n",
    "4. the agent then trains on the acquired experience\n",
    "5. the replay buffer is cleared to make space for new data\n",
    "6. Log the metrics and store them on disk\n",
    "7. Save the Agent ( via checkpoints) as well as the policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JYnDCTq8Zj1Q"
   },
   "source": [
    "We recommend doing the training for `15,000 loops` with 2 steps per loop, and an **agent alpha** of `10.0`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Qnc7dh5uZj1Q"
   },
   "outputs": [],
   "source": [
    "#TODO \n",
    "# Replace `None` with the above given values\n",
    "AGENT_ALPHA = None\n",
    "TRAINING_LOOPS = None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XOM4j723Kxc2"
   },
   "source": [
    "**Note:** The training will take around 50 minutes to complete and all the data are stored in the `log_path` directory. If it takes more time then click **Kernel** > **Interrupt Kernel** and proceed further."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "hryNDsarKxdo"
   },
   "outputs": [],
   "source": [
    "## TRAINING\n",
    "#TOFINISH: define number of training loops and write the training function\n",
    "TRAINING_LOOPS = None # We recommend doing 15k loops\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "for _ in range(TRAINING_LOOPS):\n",
    "    # step 1: We first use the DynamicStepdriver instance to collect experience\n",
    "    #(trajectories) from the environment and fill up the replay buffer.\n",
    "    \n",
    " \n",
    "    # step 2: We then extract all the stored experience from the replay buffer by\n",
    "    #specfiying the batch size and num_steps the same as we initialized the driver with.\n",
    "    # We extract it as tf.dataset instance.\n",
    "    \n",
    "   \n",
    "    # step 3: We then iterate on the tf.dataset and the first sample we draw \n",
    "    #actually has all the data batch_size*num_time_steps\n",
    "\n",
    "    \n",
    "    # step 4: The agent then trains on the acquired experience\n",
    "    train_loss = agent.train(experience).loss\n",
    "    \n",
    "    # step 5:  the replay buffer is cleared to make space for new data\n",
    "    \n",
    "    \n",
    "    # step 6: Log the metrics and store them on disk\n",
    "    metric_utils.log_metrics(metrics)\n",
    "    for metric in metrics:\n",
    "        metric.tf_summaries(train_step=step_metric.result())\n",
    "    \n",
    "    # step 7: Save the Agent ( via checkpoints) as well as the policy\n",
    "    checkpoint_manager.save()\n",
    "    saver.save(os.path.join(log_path, \"./\", 'policy_%d' % step_metric.result()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4f77CYkOKxdv"
   },
   "source": [
    "One last task before starting the training: let's upload the tensoboard logs, to get an overview of the performance of our model. We will upload our logs to `tensorboard.dev` and for that you need to \n",
    "**run the print statement below and copy the output of the cell (which is a command) into a terminal, then execute the command from there. It will give you a link from which you need to copy/paste the authentication code, and once that is done, you will receive the \n",
    "url of your model evaluation, hosted on a public [tensorboard.dev](https://tensorboard.dev/) instance**. As soon as you kicked off the training in the subsequent cell, you should see some graphs as in the picture below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "0nkHrPBPKxdv"
   },
   "outputs": [],
   "source": [
    "print(\"tensorboard dev upload --logdir {} --name \\\"(optional) My latest experiment\\\" --description \\\"(optional) Agent trained\\\"\".format(log_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "epKE4N-UZj1S"
   },
   "source": [
    "<img src='./assets/example_tensorboard.png'>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hYIzQQySZj1R"
   },
   "source": [
    "## 7. Inferencing with trained model & Tensorboard Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pYlG8vJcKxdq"
   },
   "source": [
    "Now that our model is trained, what if we want to determine which action to take given a new \"context\": for that we will iterate on our dataset to get the next item,\n",
    "    make a timestep out of it by wrapping the results using ts.Timestep. It expects step_type, reward, discount, and observation as input: since we are performing prediction you can fill \n",
    "        in dummy values for the first 3: only the observation/context is relevant. Read about how it works [here](https://www.tensorflow.org/agents/api_docs/python/tf_agents/trajectories/time_step/TimeStep), and perform the task below\n",
    "        \n",
    "the movielens environment provides us a private observe_method which randomly samples upto 8 user context observations, and we select one of them, and reshape it to (1,20): the shape required for the model to consume.\n",
    "       "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "SNwKJjdpKxdr"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "feature = np.reshape(environment._observe()[0], (1,20))\n",
    "feature.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "mv6cTWieKxds"
   },
   "outputs": [],
   "source": [
    "## Inference\n",
    "step = ts.TimeStep(\n",
    "        tf.constant(\n",
    "            ts.StepType.FIRST, dtype=tf.int32, shape=[1],\n",
    "            name='step_type'),\n",
    "        tf.constant(0.0, dtype=tf.float32, shape=[1], name='reward'),\n",
    "        tf.constant(1.0, dtype=tf.float32, shape=[1], name='discount'),\n",
    "        tf.constant(feature,\n",
    "                    dtype=tf.float64, shape=[1, 20],\n",
    "                    name='observation'))\n",
    "\n",
    "agent.policy.action(step).action.numpy()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wVVCVmQZZj1R"
   },
   "source": [
    "The output of the function above recommends ( 0 indexed) movie number to recommend to the user ( represented by the user context vector). \n",
    "Read section 1 and the documentation for more clarification around this. "
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "exercise_movielens_notebook.ipynb",
   "provenance": []
  },
  "environment": {
   "kernel": "python3",
   "name": "tf2-gpu.2-3.m91",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-3:m91"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
